Artistic Style Transfer

CNN
Deep Learning
Style Transfer
Author

Yassine Mhedhbi

Published

December 13, 2020

Neural style transfer is a method to combine two images together, more specifically, the content of one image and the art/style of the other.

The idea here is to start with a generated picture which is a copy of the original picture you wish to transform, compute its content loss with the original picture, compute its art loss with the artistic image, update the weights of the image generated to reduce the loss of a linear combination of the two losses. We will defined the losses funtions in a bit which are obtained from this paper

Model set up

Buiding on the feature extractor from the previous guide with a slight modification as we are saving the features instead of displaying them.

class FeatureExtractor(nn.Module):
    def __init__(self, model, blocks):
        super().__init__()
        self.features = model.features
        self.blocks = blocks
        
    def forward(self, x):
        features = []
        for n, layer in enumerate(self.features):
            x = layer(x)
            if n in self.blocks:
                features.append(x)
        return features


vggmodel = models.vgg19(weights=models.VGG19_Weights)
C:\Users\kaito_kid14\.conda\envs\deep\lib\site-packages\torchvision\models\_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=VGG19_Weights.IMAGENET1K_V1`. You can also use `weights=VGG19_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)

Loss

We define the loss to be a linear combination of the content loss and style loss. \[Loss = \alpha * contentLoss + \beta * styleLoss \]

where \(\alpha\) & \(\beta\) are positive entities

we can use MSE for the content loss, we trying to see the difference between the original features extarcted and the ones from the generaped pic

Content loss generated at channel \(l\): \[ ContentLoss_l(gen, ori) = \frac{1}{2} \sum_{i,j} (gen_{i,j,l} - ori_{i, j, l})^2\]

def content_loss(generated_features, original_features):
    return ((generated_features - original_features) ** 2).mean()

for the style loss, we will calculate the gram matrix for the genarted features which will give us the correlation between channels/filters response. we will also do the same for the style. The loss at a layer l is: \[E_l = \frac{1}{4N_l^2M_l^2}\sum_{i,j}(G_{i,j}^l - A_{i,j}^l)^2 \] l, \(N_{l}\) and \(M_{l}\) are the numbers of channels and height and width in the feature representation of layer l.

to get the loss at all layers, we do MSE across all layers

def style_loss(generated, style):
    batch_size, channel, h, w=generated.shape
    G = torch.mm(generated.view(channel, h * w), generated.view(channel, h * w).t())
    A = torch.mm(style.view(channel, h * w), style.view(channel, h * w).t())
    return ((G-A) ** 2).mean()

Back to our original formula: \[Loss = \alpha * contentLoss + \beta * styleLoss \]

def loss(generated_features, original_features, style_featues, alpha, beta):
    s_loss, c_loss = 0, 0
    for gen, cont, style in zip(generated_features, original_features, style_featues):
        c_loss += content_loss(gen, cont)
        s_loss += style_loss(gen, style)
    
    total_loss = alpha * c_loss + beta * s_loss 
    return total_loss

Style Transfer Attempt

Lets take a look at our image:

Original image

Artistic Image: The Starry Night

Training loop

A regular training loop would adjust the model parameters based on the loss of such parameters however in this case, we are asjusting the weights on the generated image (check input of torch.optim.Adam), so we can set our model in eval mode in order to save memory as we dont need the gradients for the model parameters.

The \(\alpha\) and \(\beta\) are hyperparameters which we can manipulate if we wish to add more style or maintain more of the content.

The optimizer will

def read_image(path):
    img = Image.open(path)
    trans = transforms.Compose([transforms.Resize((512,512)), transforms.ToTensor()])
    data = trans(img)
    return data.unsqueeze(0)

original_image = read_image(original_path)
style_image = read_image(style_path)

feature_extractor = FeatureExtractor(vggmodel, [0, 5, 10, 19, 28]).eval()
from torchvision.utils import save_image

shape = transforms.ToTensor()(Image.open(original_path)).shape


lr=0.01
alpha=8
beta=70

def transfer_style(model, original_image, style_image, epoch=5000):
    generated_image = original_image.clone().requires_grad_(True)
    optimizer = torch.optim.Adam([generated_image],lr=lr)
    images = []
    for e in range (epoch):
        #extract original features, generated features & style features to compute loss
        # original and generated for content loss,  geenrated and style for style loss
        original_features = model(original_image)
        generated_features = model(generated_image)
        style_features = model(style_image)
    
        total_loss = loss(generated_features, original_features, style_features, alpha, beta)
        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()
        
        if(e % 100 == 0):
            images.append(transforms.ToPILImage() (transforms.Resize(shape[-2:])(generated_image.squeeze(0)))) 
            #save_image(transforms.Resize(shape[-2:])(generated_image), dic/'output'/f'gen{e//100}.png')
    return images

Lastly all we have to do is call transfer_style(feature_extractor, original_image, style_image), This will return a list of images which are saved every 100 epochs, in our case we will have 50 different pictures from original image and gradually applying the painting style as we go. Here is a sample output.

Tip

Use high quality images for better output, I have used a low resolution photo to run this quickly and it shows.

Output Image of Sidi Bou Said mixed with Van Gogh’s starry night

Conclusion

We have applied neural style transfer, to get different effects, we could play around with \(\alpha\) and \(\beta\), the number of epochs and the input images of course.

Our training loop looks very similar to a regular model training loop however we are not training the model in this case, we are just using its feature attr to extract the features from the images, the loss then is computed and backpropagation is performed which updates the weight of the generated picture.